We’ll use data from http://www.tfl.gov.uk to analyse usage of the London
Bike Sharing scheme. This data has already been downloaded for you and
exists in a CSV (Comma Separated Values) file that you have
to read in to R.
There is no dropdown menu to read in your data to R, so instead we
use functions like read_csv to load data from file into R
objects.
Sometimes our data needs a bit of ‘cleaning’. For instance,
day_of_week is variable type character, or
chr. We should, however, treat it as a categorical, or
factor variable and relevel it, so Monday is the first
level of the factor (or first day of week), etc.
R is fairly sensitive with dates. When you read a CSV file, the date
may be in different formats. For instance, Christmas 2017 could be input
as 12-25-2017, 25.12.2017, 25 Dec 2017, Dec 25, 2017, etc. To be
consistent, we use lubridate’s ymd function, and force
variable Day to be a date in the format YYYY-MM-DD
Finally, we can turn season from 1, 2, 3, 4, to words
like Winter, Spring, etc.
We will be talking more about data wrangling later, but for now just execute the following piece of code.
# fix dates using lubridate, and generate new variables for year, month, month_name, day, and day_of _week
bike <- bike %>%
mutate(
year=year(date),
month = month(date),
month_name=month(date, label = TRUE),
day_of_week = wday(date, label = TRUE),
wday = as.character.factor(day_of_week),
weekend = if_else((day_of_week == "Sat" | day_of_week == "Sun"), TRUE, FALSE))
# generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
bike <- bike %>%
mutate(
season_name = case_when(
month_name %in% c("Dec", "Jan", "Feb") ~ "Winter",
month_name %in% c("Mar", "Apr", "May") ~ "Spring",
month_name %in% c("Jun", "Jul", "Aug") ~ "Summer",
month_name %in% c("Sep", "Oct", "Nov") ~ "Autumn",
),
season_name = factor(season_name,
levels = c("Winter", "Spring", "Summer", "Autumn")))
#examine the structure of the datafame
skim(bike)| Name | bike |
| Number of rows | 5084 |
| Number of columns | 40 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| factor | 3 |
| logical | 2 |
| numeric | 26 |
| POSIXct | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| preciptype | 1346 | 0.74 | 4 | 9 | 0 | 3 | 0 |
| conditions | 0 | 1.00 | 4 | 28 | 0 | 11 | 0 |
| description | 0 | 1.00 | 26 | 85 | 0 | 78 | 0 |
| icon | 0 | 1.00 | 4 | 17 | 0 | 6 | 0 |
| set | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
| wday | 0 | 1.00 | 3 | 3 | 0 | 7 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month_name | 0 | 1 | TRUE | 12 | Jan: 434, Mar: 434, May: 434, Aug: 434 |
| day_of_week | 0 | 1 | TRUE | 7 | Fri: 727, Sat: 727, Sun: 726, Mon: 726 |
| season_name | 0 | 1 | FALSE | 4 | Spr: 1288, Aut: 1274, Win: 1264, Sum: 1258 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| severerisk | 5084 | 0 | NaN | : |
| weekend | 0 | 1 | 0.29 | FAL: 3631, TRU: 1453 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1.00 | 26370.17 | 9593.67 | 0.0 | 19616.00 | 26047.50 | 32985.75 | 73094.00 | ▂▇▆▁▁ |
| tempmax | 0 | 1.00 | 14.64 | 6.36 | -2.6 | 10.00 | 14.30 | 19.40 | 36.10 | ▁▇▇▃▁ |
| tempmin | 0 | 1.00 | 7.73 | 5.02 | -11.1 | 4.00 | 7.80 | 11.50 | 22.10 | ▁▃▇▇▁ |
| temp | 0 | 1.00 | 11.13 | 5.47 | -5.9 | 7.10 | 11.00 | 15.40 | 28.30 | ▁▅▇▅▁ |
| feelslikemax | 0 | 1.00 | 13.98 | 7.24 | -7.2 | 10.00 | 14.30 | 19.40 | 38.20 | ▁▅▇▃▁ |
| feelslikemin | 0 | 1.00 | 6.04 | 6.36 | -14.6 | 1.00 | 5.80 | 11.50 | 22.10 | ▁▅▇▇▂ |
| feelslike | 0 | 1.00 | 9.97 | 6.67 | -9.1 | 4.80 | 10.50 | 15.33 | 28.90 | ▁▆▇▇▁ |
| dew | 0 | 1.00 | 7.33 | 4.67 | -8.0 | 4.00 | 7.50 | 10.90 | 19.00 | ▁▃▇▇▂ |
| humidity | 0 | 1.00 | 79.34 | 10.43 | 39.6 | 72.30 | 80.30 | 87.30 | 99.90 | ▁▂▅▇▅ |
| precip | 0 | 1.00 | 1.64 | 4.75 | 0.0 | 0.00 | 0.19 | 1.38 | 168.20 | ▇▁▁▁▁ |
| precipprob | 0 | 1.00 | 73.35 | 44.22 | 0.0 | 0.00 | 100.00 | 100.00 | 100.00 | ▃▁▁▁▇ |
| precipcover | 0 | 1.00 | 8.67 | 10.97 | 0.0 | 0.00 | 8.33 | 12.50 | 100.00 | ▇▁▁▁▁ |
| snow | 0 | 1.00 | 0.02 | 0.25 | 0.0 | 0.00 | 0.00 | 0.00 | 8.40 | ▇▁▁▁▁ |
| snowdepth | 0 | 1.00 | 0.05 | 0.56 | 0.0 | 0.00 | 0.00 | 0.00 | 18.10 | ▇▁▁▁▁ |
| windgust | 1126 | 0.78 | 42.50 | 14.80 | 9.4 | 31.30 | 41.40 | 52.08 | 120.90 | ▅▇▃▁▁ |
| windspeed | 0 | 1.00 | 23.06 | 8.44 | 5.1 | 16.90 | 21.90 | 27.80 | 72.20 | ▅▇▂▁▁ |
| winddir | 0 | 1.00 | 196.67 | 92.57 | 0.1 | 126.47 | 221.40 | 258.83 | 359.80 | ▃▂▃▇▃ |
| sealevelpressure | 0 | 1.00 | 1014.86 | 10.55 | 966.6 | 1008.60 | 1015.80 | 1022.00 | 1047.80 | ▁▁▇▇▁ |
| cloudcover | 0 | 1.00 | 60.74 | 22.23 | 0.0 | 46.70 | 62.70 | 77.53 | 100.00 | ▁▃▇▇▅ |
| visibility | 0 | 1.00 | 22.93 | 10.67 | 0.2 | 15.50 | 22.40 | 29.10 | 62.50 | ▃▇▆▁▁ |
| solarradiation | 0 | 1.00 | 113.71 | 85.28 | 0.0 | 41.18 | 93.60 | 172.52 | 358.60 | ▇▅▃▂▁ |
| solarenergy | 0 | 1.00 | 9.81 | 7.37 | 0.0 | 3.50 | 8.10 | 14.90 | 31.00 | ▇▅▃▂▁ |
| uvindex | 0 | 1.00 | 4.29 | 2.54 | 0.0 | 2.00 | 4.00 | 6.00 | 10.00 | ▇▆▅▅▁ |
| moonphase | 0 | 1.00 | 0.48 | 0.29 | 0.0 | 0.25 | 0.49 | 0.75 | 0.98 | ▇▇▇▇▇ |
| year | 0 | 1.00 | 2017.04 | 4.04 | 2010.0 | 2014.00 | 2017.00 | 2021.00 | 2024.00 | ▆▇▇▇▆ |
| month | 0 | 1.00 | 6.52 | 3.46 | 1.0 | 4.00 | 7.00 | 10.00 | 12.00 | ▇▅▅▅▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2010-07-30 00:00:00 | 2024-06-29 00:00:00 | 2017-07-14 12:00:00 | 5084 |
| sunrise | 0 | 1 | 2010-07-30 05:17:39 | 2024-06-29 04:46:46 | 2017-07-14 16:57:02 | 5084 |
| sunset | 0 | 1 | 2010-07-30 20:50:39 | 2024-06-29 21:21:22 | 2017-07-15 09:10:33 | 5084 |
Besides the number of bikes hired each day, we also have data on the weather for that day
Having loaded and cleaned our data, we can create summary statistics
using mosaic’s favstats. This is uni-variate analysis, so
in our mosaic syntax we will use
favstats(~ bikes_hired, data= bike). We also want to get an
overall, time-series plot of bikes over time; for the latter, we just
create a scatter plot of bikes_hired vs
Day.
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.96e+04 | 2.6e+04 | 3.3e+04 | 7.31e+04 | 2.64e+04 | 9.59e+03 | 5084 | 0 |
While this analysis shows us overall statistics, what if we wanted to
get summary statistics by year, day_of_week,
month, or season? Mosaic’s syntax
Y ~ X alows us to facet our analysis of a variable Y by
another variable X using the syntax
favstats( Y ~ Z, data=...)
| year | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 2010 | 2.76e+03 | 9.3e+03 | 1.4e+04 | 1.87e+04 | 2.8e+04 | 1.41e+04 | 5.62e+03 | 155 | 0 |
| 2011 | 4.56e+03 | 1.63e+04 | 2.03e+04 | 2.37e+04 | 2.94e+04 | 1.96e+04 | 5.5e+03 | 365 | 0 |
| 2012 | 3.53e+03 | 1.93e+04 | 2.62e+04 | 3.25e+04 | 4.71e+04 | 2.6e+04 | 9.43e+03 | 366 | 0 |
| 2013 | 3.73e+03 | 1.76e+04 | 2.2e+04 | 2.74e+04 | 3.56e+04 | 2.2e+04 | 7.28e+03 | 365 | 0 |
| 2014 | 4.33e+03 | 2.05e+04 | 2.77e+04 | 3.44e+04 | 4.9e+04 | 2.75e+04 | 9.07e+03 | 365 | 0 |
| 2015 | 5.78e+03 | 2.21e+04 | 2.66e+04 | 3.29e+04 | 7.31e+04 | 2.7e+04 | 8.55e+03 | 365 | 0 |
| 2016 | 4.89e+03 | 2.24e+04 | 2.79e+04 | 3.51e+04 | 4.66e+04 | 2.82e+04 | 8.85e+03 | 366 | 0 |
| 2017 | 5.14e+03 | 2.41e+04 | 2.95e+04 | 3.45e+04 | 4.6e+04 | 2.86e+04 | 8.38e+03 | 365 | 0 |
| 2018 | 5.86e+03 | 2.18e+04 | 2.92e+04 | 3.77e+04 | 4.61e+04 | 2.9e+04 | 1.02e+04 | 365 | 0 |
| 2019 | 5.65e+03 | 2.42e+04 | 2.89e+04 | 3.45e+04 | 4.47e+04 | 2.86e+04 | 8.09e+03 | 365 | 0 |
| 2020 | 4.87e+03 | 2.01e+04 | 2.75e+04 | 3.65e+04 | 7.02e+04 | 2.85e+04 | 1.16e+04 | 366 | 0 |
| 2021 | 6.25e+03 | 2.17e+04 | 3.1e+04 | 3.82e+04 | 5.69e+04 | 3e+04 | 1.1e+04 | 365 | 0 |
| 2022 | 0 | 2.51e+04 | 3.14e+04 | 4e+04 | 6.7e+04 | 3.15e+04 | 1.03e+04 | 365 | 0 |
| 2023 | 6.72e+03 | 1.99e+04 | 2.37e+04 | 2.78e+04 | 3.53e+04 | 2.34e+04 | 6.03e+03 | 365 | 0 |
| 2024 | 7.48e+03 | 1.85e+04 | 2.29e+04 | 2.71e+04 | 3.52e+04 | 2.28e+04 | 5.91e+03 | 181 | 0 |
| day_of_week | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Sun | 0 | 1.41e+04 | 2.1e+04 | 2.99e+04 | 6.31e+04 | 2.23e+04 | 1.06e+04 | 726 | 0 |
| Mon | 3.97e+03 | 2.06e+04 | 2.58e+04 | 3.17e+04 | 6.7e+04 | 2.61e+04 | 8.37e+03 | 726 | 0 |
| Tue | 3.76e+03 | 2.24e+04 | 2.78e+04 | 3.46e+04 | 6.53e+04 | 2.82e+04 | 8.71e+03 | 726 | 0 |
| Wed | 4.33e+03 | 2.26e+04 | 2.79e+04 | 3.44e+04 | 5.44e+04 | 2.83e+04 | 8.65e+03 | 726 | 0 |
| Thu | 5.65e+03 | 2.25e+04 | 2.73e+04 | 3.46e+04 | 7.31e+04 | 2.83e+04 | 8.88e+03 | 726 | 0 |
| Fri | 5.4e+03 | 2.14e+04 | 2.65e+04 | 3.32e+04 | 6.7e+04 | 2.71e+04 | 8.64e+03 | 727 | 0 |
| Sat | 0 | 1.6e+04 | 2.25e+04 | 3.13e+04 | 7.02e+04 | 2.43e+04 | 1.12e+04 | 727 | 0 |
| month_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Jan | 3.73e+03 | 1.39e+04 | 1.9e+04 | 2.33e+04 | 3.8e+04 | 1.87e+04 | 5.99e+03 | 434 | 0 |
| Feb | 3.53e+03 | 1.6e+04 | 2.06e+04 | 2.46e+04 | 5.25e+04 | 2.03e+04 | 6.37e+03 | 396 | 0 |
| Mar | 5.06e+03 | 1.76e+04 | 2.31e+04 | 2.72e+04 | 5.66e+04 | 2.27e+04 | 7.59e+03 | 434 | 0 |
| Apr | 4.87e+03 | 2.1e+04 | 2.59e+04 | 3.08e+04 | 4.9e+04 | 2.6e+04 | 7.61e+03 | 420 | 0 |
| May | 1.07e+04 | 2.45e+04 | 2.99e+04 | 3.6e+04 | 7.02e+04 | 3.04e+04 | 8.44e+03 | 434 | 0 |
| Jun | 6.06e+03 | 2.77e+04 | 3.31e+04 | 3.92e+04 | 6.53e+04 | 3.34e+04 | 8.59e+03 | 419 | 0 |
| Jul | 5.56e+03 | 2.99e+04 | 3.69e+04 | 4.17e+04 | 7.31e+04 | 3.52e+04 | 8.37e+03 | 405 | 0 |
| Aug | 4.3e+03 | 2.55e+04 | 3.33e+04 | 3.89e+04 | 6.7e+04 | 3.16e+04 | 9.83e+03 | 434 | 0 |
| Sep | 0 | 2.51e+04 | 3.18e+04 | 3.65e+04 | 5.18e+04 | 3.06e+04 | 8.26e+03 | 420 | 0 |
| Oct | 7.07e+03 | 2.33e+04 | 2.8e+04 | 3.27e+04 | 4.79e+04 | 2.75e+04 | 6.9e+03 | 434 | 0 |
| Nov | 6.03e+03 | 1.87e+04 | 2.37e+04 | 2.8e+04 | 4.47e+04 | 2.32e+04 | 6.52e+03 | 420 | 0 |
| Dec | 2.76e+03 | 1.16e+04 | 1.66e+04 | 2.29e+04 | 3.91e+04 | 1.71e+04 | 6.94e+03 | 434 | 0 |
| season_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Winter | 2.76e+03 | 1.38e+04 | 1.88e+04 | 2.35e+04 | 5.25e+04 | 1.86e+04 | 6.58e+03 | 1264 | 0 |
| Spring | 4.87e+03 | 2.08e+04 | 2.61e+04 | 3.13e+04 | 7.02e+04 | 2.64e+04 | 8.51e+03 | 1288 | 0 |
| Summer | 4.3e+03 | 2.77e+04 | 3.43e+04 | 3.99e+04 | 7.31e+04 | 3.34e+04 | 9.08e+03 | 1258 | 0 |
| Autumn | 0 | 2.18e+04 | 2.74e+04 | 3.29e+04 | 5.18e+04 | 2.71e+04 | 7.86e+03 | 1274 | 0 |
While summary statistics allow us to quickly disover key metrics that represent the data, they are often not sufficient by themselves. Often, it is useful to represent the data graphically (or visually) to uncover information that is critical for decision making, such as trends, patterns and outliers.
In this section we will create a time-series scatter-plot that shows
the number of bikes hired on each day. We will use the
ggplot2 library.
Creating a basic plot is a two-step process. The first step is the
specify the data frame and the axes by using the syntax:
ggplot(data frame, aes(x=variable, y=variable). The second
step is to specify the plot that we want. In this case, we want a
scatter (point) plot using geom_point() after a
+ sign.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# Histogram faceted by month_name in 4 rows
ggplot(bike, aes(x=bikes_hired))+
geom_histogram()+
facet_wrap(~month_name, nrow = 4)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Boxplot of bikes_hired by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
geom_boxplot()+
theme_bw()+
NULL# bikes_hired vs. weather features
ggplot(bike, aes(x=temp, y= bikes_hired))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
ggplot(bike, aes(x=temp, y= bikes_hired, colour=season_name))+
geom_point(alpha = 0.2)+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
# facet_wrap(~season_name, ncol=1)+
NULL## `geom_smooth()` using formula = 'y ~ x'
# bikes on humidity
ggplot(bike, aes(x=humidity, y= bikes_hired))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
Besides the number of bikes rented out on a given day, we have also
downloaded weather data for London such as temp,
humidity, sealevelpressure,
precip, etc. that measure weather conditions on a single
day. It may be the case that more bikes are rented out when it’s warmer?
Or how can we estimate the effect of rain on rentals?
Your task is to build a regression model that helps you explain the number of rentals per day.
Let us select a few of these numerical variables and create a scatterplot-correlation matrix
bike %>%
select(cloudcover, humidity, sealevelpressure, solarradiation, precip, solarradiation, temp, bikes_hired) %>%
GGally::ggpairs()##
## Welch Two Sample t-test
##
## data: bikes_hired by weekend
## t = 13.355, df = 2218.9, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 3665.004 4926.609
## sample estimates:
## mean in group FALSE mean in group TRUE
## 27597.91 23302.10
bikes_hiredWe start the naive model where we just use the average to predict how many bikes we are going to rent out on a single day
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.96e+04 | 2.6e+04 | 3.3e+04 | 7.31e+04 | 2.64e+04 | 9.59e+03 | 5084 | 0 |
# can you create a confidence interval for mean bikes_hired? What is the SE?
model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26370.2 134.5 196 <0.0000000000000002 ***
##
## Residual standard error: 9594 on 5083 degrees of freedom
# plot actual data vs predicted data
broom::augment(model0) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()What is the regression’s residual standard error? What is the intercept standard error?
bikes_hired on temp# Define the model
model1 <- lm(bikes_hired ~ temp, data = bike)
# look at model estimated
msummary(model1)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14506.38 242.30 59.87 <0.0000000000000002 ***
## temp 1065.60 19.53 54.55 <0.0000000000000002 ***
##
## Residual standard error: 7619 on 5082 degrees of freedom
## Multiple R-squared: 0.3693, Adjusted R-squared: 0.3692
## F-statistic: 2976 on 1 and 5082 DF, p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model1) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()temp significant? Why?bikes_hired on temp plus
weekend# Define the model
model2 <- lm(bikes_hired ~ temp + weekend, data = bike)
# look at model estimated
msummary(model2)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15734.9 243.6 64.59 <0.0000000000000002 ***
## temp 1064.5 18.9 56.31 <0.0000000000000002 ***
## weekendTRUE -4254.4 228.9 -18.59 <0.0000000000000002 ***
##
## Residual standard error: 7374 on 5081 degrees of freedom
## Multiple R-squared: 0.4095, Adjusted R-squared: 0.4093
## F-statistic: 1762 on 2 and 5081 DF, p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model2) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()Fit a regression model called model2 with the
following explanatory variables: temp and
weekend
temp significant? Why?weekendTRUE?
What % of the variability of bikes_hired does your model explain?bikes_hired on temp plus
wday# Define the model
model3 <- lm(bikes_hired ~ temp + wday, data = bike)
# look at model estimated
msummary(model3)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15189.80 342.98 44.288 < 0.0000000000000002 ***
## temp 1063.56 18.77 56.650 < 0.0000000000000002 ***
## wdayMon -859.42 384.25 -2.237 0.02535 *
## wdaySat -2716.08 384.11 -7.071 0.00000000000175 ***
## wdaySun -4683.71 384.25 -12.189 < 0.0000000000000002 ***
## wdayThu 1199.99 384.25 3.123 0.00180 **
## wdayTue 1186.82 384.25 3.089 0.00202 **
## wdayWed 1249.24 384.25 3.251 0.00116 **
##
## Residual standard error: 7323 on 5076 degrees of freedom
## Multiple R-squared: 0.4181, Adjusted R-squared: 0.4173
## F-statistic: 521 on 7 and 5076 DF, p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model3) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()What is the meaning of the effect (slope) of, e.g.,
wdayMon? What % of the variability of bikes_hired does your
model explain?
model4 <- lm(bikes_hired ~ temp + wday + humidity + factor(month), data = bike)
# look at model estimated
msummary(model4)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38655.89 1086.73 35.571 < 0.0000000000000002 ***
## temp 549.73 31.96 17.202 < 0.0000000000000002 ***
## wdayMon -935.14 352.69 -2.651 0.008039 **
## wdaySat -2659.04 352.55 -7.542 0.000000000000054446 ***
## wdaySun -4792.36 352.69 -13.588 < 0.0000000000000002 ***
## wdayThu 1256.27 352.67 3.562 0.000371 ***
## wdayTue 1262.61 352.69 3.580 0.000347 ***
## wdayWed 1364.43 352.69 3.869 0.000111 ***
## humidity -259.36 11.24 -23.079 < 0.0000000000000002 ***
## factor(month)2 414.96 468.66 0.885 0.375975
## factor(month)3 1440.90 463.30 3.110 0.001881 **
## factor(month)4 1959.58 489.46 4.004 0.000063291875370202 ***
## factor(month)5 4374.43 522.28 8.376 < 0.0000000000000002 ***
## factor(month)6 5657.20 573.11 9.871 < 0.0000000000000002 ***
## factor(month)7 5748.44 617.19 9.314 < 0.0000000000000002 ***
## factor(month)8 2975.77 599.64 4.963 0.000000718429322077 ***
## factor(month)9 4523.38 555.88 8.137 0.000000000000000504 ***
## factor(month)10 4728.24 505.28 9.358 < 0.0000000000000002 ***
## factor(month)11 3554.63 470.98 7.547 0.000000000000052380 ***
## factor(month)12 -1474.35 458.34 -3.217 0.001305 **
##
## Residual standard error: 6721 on 5064 degrees of freedom
## Multiple R-squared: 0.511, Adjusted R-squared: 0.5092
## F-statistic: 278.5 on 19 and 5064 DF, p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model4) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()model5 <- lm(bikes_hired ~ temp + wday + tempmin + tempmax + feelslike + solarradiation +
humidity + factor(month) + solarradiation + precip, data = bike)
# look at model estimated
msummary(model5)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30697.889 1382.125 22.211 < 0.0000000000000002 ***
## temp -707.761 244.946 -2.889 0.003875 **
## wdayMon -940.524 334.977 -2.808 0.005008 **
## wdaySat -2629.407 334.918 -7.851 0.00000000000000501 ***
## wdaySun -4835.004 335.154 -14.426 < 0.0000000000000002 ***
## wdayThu 1289.848 334.966 3.851 0.000119 ***
## wdayTue 1249.543 335.123 3.729 0.000195 ***
## wdayWed 1393.875 334.995 4.161 0.00003222801551785 ***
## tempmin -398.679 100.523 -3.966 0.00007408336651012 ***
## tempmax 808.435 106.997 7.556 0.00000000000004916 ***
## feelslike 666.935 122.321 5.452 0.00000005207070115 ***
## solarradiation 2.924 1.855 1.577 0.114952
## humidity -165.448 12.165 -13.600 < 0.0000000000000002 ***
## factor(month)2 635.449 448.721 1.416 0.156798
## factor(month)3 637.421 459.093 1.388 0.165065
## factor(month)4 366.953 522.547 0.702 0.482562
## factor(month)5 2828.429 578.623 4.888 0.00000104903019514 ***
## factor(month)6 4232.615 631.773 6.700 0.00000000002317766 ***
## factor(month)7 4736.883 666.070 7.112 0.00000000000130611 ***
## factor(month)8 2071.534 629.229 3.292 0.001001 **
## factor(month)9 3046.323 563.545 5.406 0.00000006753898883 ***
## factor(month)10 3790.875 498.740 7.601 0.00000000000003483 ***
## factor(month)11 3125.722 450.053 6.945 0.00000000000425592 ***
## factor(month)12 -1451.759 435.445 -3.334 0.000862 ***
## precip -246.751 19.696 -12.528 < 0.0000000000000002 ***
##
## Residual standard error: 6383 on 5059 degrees of freedom
## Multiple R-squared: 0.5594, Adjusted R-squared: 0.5573
## F-statistic: 267.6 on 24 and 5059 DF, p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model5) |>
mutate(day=row_number()) |>
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()## GVIF Df GVIF^(1/(2*Df))
## temp 224.071749 1 14.969026
## wday 1.006055 6 1.000503
## tempmin 31.817179 1 5.640672
## tempmax 57.709590 1 7.596683
## feelslike 82.967682 1 9.108660
## solarradiation 3.121011 1 1.766638
## humidity 2.006571 1 1.416535
## factor(month) 8.074003 11 1.099591
## precip 1.092009 1 1.044992
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bikes_hired?Let us say that your best model is model3. How do the
predictions of your model compare to the actual data?
# plot actual data vs predicted data
my_best_model <- broom::augment(model3) %>%
mutate(day=row_number())
# Plot fitted line and residuals
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
theme_bw()As you keep building your models, it makes sense to:
autoplot(model_x). You will
always have some deviation from normality, especially for very high
values of ncar::vif(model_x) to calculate the
Variance Inflation Factor (VIF) for your predictors and
determine whether you have colinear variables. A general guideline is
that a VIF larger than 10 is large, and your model may suffer from
collinearity. Remove the variable in question and run your model again
without it.## temp weekend
## 1.00001 1.00001
## GVIF Df GVIF^(1/(2*Df))
## temp 1.000077 1 1.000038
## wday 1.000077 6 1.000006
## GVIF Df GVIF^(1/(2*Df))
## temp 3.440138 1 1.854761
## wday 1.001375 6 1.000115
## humidity 1.544302 1 1.242700
## factor(month) 3.928781 11 1.064172
Create a summary table, using huxtable (https://am01-sep23.netlify.app/example/modelling_side_by_side_tables/)
that shows which models you worked on, which predictors are significant,
the adjusted \(R^2\), and the Residual
Standard Error. If you want to add more models, just make sure you do
not forget the comma , after the last model, as shown
below
# produce summary table comparing models using huxtable::huxreg()
huxreg(model0, model1, model2, model3, model4,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 26370.172 *** | 14506.378 *** | 15734.845 *** | 15189.803 *** | 38655.890 *** |
| (134.549) | (242.303) | (243.625) | (342.975) | (1086.735) | |
| temp | 1065.598 *** | 1064.468 *** | 1063.559 *** | 549.733 *** | |
| (19.533) | (18.903) | (18.774) | (31.958) | ||
| weekendTRUE | -4254.351 *** | ||||
| (228.899) | |||||
| wdayMon | -859.420 * | -935.139 ** | |||
| (384.250) | (352.685) | ||||
| wdaySat | -2716.075 *** | -2659.042 *** | |||
| (384.114) | (352.555) | ||||
| wdaySun | -4683.707 *** | -4792.364 *** | |||
| (384.248) | (352.689) | ||||
| wdayThu | 1199.987 ** | 1256.269 *** | |||
| (384.247) | (352.670) | ||||
| wdayTue | 1186.822 ** | 1262.607 *** | |||
| (384.247) | (352.686) | ||||
| wdayWed | 1249.237 ** | 1364.426 *** | |||
| (384.247) | (352.694) | ||||
| humidity | -259.360 *** | ||||
| (11.238) | |||||
| factor(month)2 | 414.955 | ||||
| (468.657) | |||||
| factor(month)3 | 1440.898 ** | ||||
| (463.301) | |||||
| factor(month)4 | 1959.576 *** | ||||
| (489.460) | |||||
| factor(month)5 | 4374.431 *** | ||||
| (522.280) | |||||
| factor(month)6 | 5657.200 *** | ||||
| (573.105) | |||||
| factor(month)7 | 5748.443 *** | ||||
| (617.188) | |||||
| factor(month)8 | 2975.771 *** | ||||
| (599.642) | |||||
| factor(month)9 | 4523.377 *** | ||||
| (555.882) | |||||
| factor(month)10 | 4728.243 *** | ||||
| (505.279) | |||||
| factor(month)11 | 3554.632 *** | ||||
| (470.980) | |||||
| factor(month)12 | -1474.351 ** | ||||
| (458.336) | |||||
| #observations | 5084 | 5084 | 5084 | 5084 | 5084 |
| R squared | 0.000 | 0.369 | 0.409 | 0.418 | 0.511 |
| Adj. R Squared | 0.000 | 0.369 | 0.409 | 0.417 | 0.509 |
| Residual SE | 9593.667 | 7619.480 | 7373.692 | 7323.393 | 6721.367 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||